confusion network
Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
Prakash, Jeena, Kumar, Blessingh, Hacioglu, Kadri, Sharma, Bidisha, Gopalan, Sindhuja, Chetlur, Malolan, Venkatesan, Shankar, Stolcke, Andreas
Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.
SoftCTC -- Semi-Supervised Learning for Text Recognition using Soft Pseudo-Labels
Kišš, Martin, Hradiš, Michal, Beneš, Karel, Buchal, Petr, Kula, Michal
This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial component of pseudo-labeling approaches to semi-supervised learning. We demonstrate the effectiveness of our method on a challenging handwriting recognition task and conclude that SoftCTC matches the performance of a finely-tuned filtering based pipeline. We also evaluated SoftCTC in terms of computational efficiency, concluding that it is significantly more efficient than a na\"ive CTC-based approach for training on multiple transcription variants, and we make our GPU implementation public.
Streaming Speech-to-Confusion Network Speech Recognition
Filimonov, Denis, Pandey, Prabhat, Rastrow, Ariya, Gandhe, Ankur, Stolcke, Andreas
In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems
Beneš, Karel, Kocour, Martin, Burget, Lukáš
End-to-end (e2e) systems have recently gained wide popularity in automatic speech recognition. However, these systems do generally not provide well-calibrated word-level confidences. In this paper, we propose Hystoc, a simple method for obtaining word-level confidences from hypothesis-level scores. Hystoc is an iterative alignment procedure which turns hypotheses from an n-best output of the ASR system into a confusion network. Eventually, word-level confidences are obtained as posterior probabilities in the individual bins of the confusion network. We show that Hystoc provides confidences that correlate well with the accuracy of the ASR hypothesis. Furthermore, we show that utilizing Hystoc in fusion of multiple e2e ASR systems increases the gains from the fusion by up to 1\,\% WER absolute on Spanish RTVE2020 dataset. Finally, we experiment with using Hystoc for direct fusion of n-best outputs from multiple systems, but we only achieve minor gains when fusing very similar systems.
Transformer-based encoder-encoder architecture for Spoken Term Detection
Švec, Jan, Šmídl, Luboš, Lehečka, Jan
The paper presents a method for spoken term detection based on In this work, we do not focus on the direct processing of the the Transformer architecture. We propose the encoder encoder input speech signal. Instead, we use the speech recognizer to convert architecture employing two BERT-like encoders with additional an audio signal into a graphemic recognition hypothesis. The modifications, including convolutional and upsampling layers, attention representation of speech at the grapheme level allows preprocessing masking, and shared parameters. The encoders project a the input audio into a compact confusion network and further to a recognized hypothesis and a searched term into a shared embedding sequence of embedding vectors. In [7], we proposed a Deep LSTM space, where the score of the putative hit is computed using the calibrated architecture for spoken term detection, which uses the projection dot product. In the experiments, we used the Wav2Vec 2.0 of both the input speech and searched term into a shared embedding speech recognizer, and the proposed system outperformed a baseline space. The hybrid DNN-HMM speech recognizer produced method based on deep LSTMs on the English and Czech STD phoneme confusion networks representing the input speech. The datasets based on USC Shoah Foundation Visual History Archive DNN-HMM speech recognizer can be replaced with the Wav2Vec (MALACH).
Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings
Švec, Jan, Šmídl, Luboš, Psutka, Josef V., Pražák, Aleš
The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
Švec, Jan, Lehečka, Jan, Šmídl, Luboš
In recent years, the standard hybrid DNN-HMM speech recognizers are outperformed by the end-to-end speech recognition systems. One of the very promising approaches is the grapheme Wav2Vec 2.0 model, which uses the self-supervised pretraining approach combined with transfer learning of the fine-tuned speech recognizer. Since it lacks the pronunciation vocabulary and language model, the approach is suitable for tasks where obtaining such models is not easy or almost impossible. In this paper, we use the Wav2Vec speech recognizer in the task of spoken term detection over a large set of spoken documents. The method employs a deep LSTM network which maps the recognized hypothesis and the searched term into a shared pronunciation embedding space in which the term occurrences and the assigned scores are easily computed. The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec. The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.
Understanding Domain Specific Languages(CS)
Abstract: Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs.
ConfNet2Seq: Full Length Answer Generation from Spoken Questions
Pal, Vaishali, Shrivastava, Manish, Besacier, Laurent
Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.
Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance
Fusayasu, Yohei (Kobe University) | Tanaka, Katsuyuki (Kobe University) | Takiguchi, Tetsuya (Kobe University) | Ariki, Yasuo (Kobe University)
In spite of the recent advancements being made in speech recognition, recognition errors are unavoidable in continuous speech recognition. In this paper, we focus on a word-error correction system for continuous speech recognition using confusion networks.Conventional N-gram correction is widely used; however, the performance degrades due to the fact that the N-gram approach cannot measure information between long distance words. In order to improve the performance of the N-gram model, we employ Normalized Relevance Distance (NRD) as a measure for semantic similarity between words. NRD can identify not only co-occurrence but also the correlation of importance of the terms in documents. Even if the words are located far from each other, NRD can estimate the semantic similarity between the words. The effectiveness of our method was evaluated in continuous speech recognition tasks for multiple test speakers. Experimental results show that our error-correction method is the most effective approach as compared to the methods using other features.